class: center, middle, inverse, title-slide .title[ # Simple linear regression ] .subtitle[ ## Lecture 1 ] .author[ ### Manuel Villarreal ] .date[ ### 08/26/24 ] --- ### Blood pressure - Researchers at UT Austin want to study the relationship between systolic blood pressure and age. They design a study to measure the blood pressure of 200 individuals alongside other variables of interest. -- - Let's first look at the data. --- ### Data: blood pressure
--- ### Data visualization .panelset[ .panel[.panel-name[blood pressure] <img src="data:image/png;base64,#lecture-1_files/figure-html/hist-bp-1.png" width="504" style="display: block; margin: auto;" /> ] .panel[.panel-name[Age] <img src="data:image/png;base64,#lecture-1_files/figure-html/hist-age-1.png" width="504" style="display: block; margin: auto;" /> ] .panel[.panel-name[Scatter plot] <img src="data:image/png;base64,#lecture-1_files/figure-html/scater-bp-age-1.png" width="504" style="display: block; margin: auto;" /> ] ] --- ### Blood pressure vs age - From the scatter plot it seems like a linear function would be a good starting point to study the association between the variables of interest. -- - The simple linear model can be expressed as: `$$Y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_i$$` --- ### Notation: `$$Y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_i$$` - `\(i = \{1, 2, \dots, n\}\)` is an indicator variable for the observation number. -- - `\(Y_i\)` is the **i-th** observation or outcome. -- - `\(X_{i1}\)` is the predictor for the **i-th** observation. -- - `\(\beta_0\)` is the intercept or the expected value of the outcome variable for an observation with a predictor with a value of 0. -- - `\(\beta_1\)` is the slope or the rate of change of the expected value of the outcome for a one unit difference in the predictor. --- ### Linear regression - The simple linear model has two components: -- 1. Systematic component (**aka** mean model) : `$$\mu_i = \beta_0 + \beta_1 X_{i1}$$` -- 1. Random component (**aka** error term): `$$\epsilon_i$$` -- - For now we will assume that: `$$\epsilon_i \overset{iid}{\sim} N(0,\sigma^2)$$` -- - This will make things easier for now, however, we will see that this assumption can be bent a little. --- ### What now? - With our current specification we should be able to express a participant's blood pressure as: `$$\text{blood pressure}_i = \beta_0 + \beta_1 \text{age}_{i} + \epsilon_i$$` -- - Notice that we don't know the values of `\(\beta_0\)`, `\(\beta_1\)` or `\(\epsilon_i\)`. If we had access to the entire population this would be easy, we just need to measure everyone's blood pressure and age in order to find those values. -- - This is very hard to do so instead of measuring the entire population we can try to estimate those unknown values known as *parameters* from a sample of the population instead. --- ### Parameter estimation - How can we estimate the parameters of a model? -- - We can think of estimation as a decision problem were we want to choose one or more values from a set of candidates. -- - For example, we could start by guessing that the values could be: `\(\beta_0 = 100, \quad \beta_1 = 0.5, \quad \sigma^2 = 2\)` -- - We can compare this first attempt with our data --- ### Example line `$$\mathrm{E}\left(\text{blood pressure}_i\right) = 100 + 1/2\ \text{age}_{i}$$` <img src="data:image/png;base64,#lecture-1_files/figure-html/example-line-1.png" width="504" style="display: block; margin: auto;" /> --- ## Example line - The previous plot showed us that a lot of our observations were above the line so what can we do next? -- - We can frame the problem of estimating the parameters of a model as a decision problem! -- - In this case, we want to find the values of the parameters that will let us optimize some criterion or function, which will depend on our candidate values of each of the parameters in the model. -- - We refer to these functions as **objective functions**. -- **Note:** The derivative of these objective functions also has a name: **score function** --- ### Objective functions - There are many objective functions that we could choose from (eventually we will use different ones during this course). -- - For now we will use what is known as the Mean Squared Error as an objective function. -- - According to this criterion, we want to find the values of `\(\hat{\beta}_0\)`, and `\(\hat{\beta}_1\)` that minimize the MSE. `$$\underset{\hat{\beta}_0, \hat{\beta}_1}{\operatorname{min}} = \frac{1}{n}\sum_{i = 1}^{n} \left(y_i - \hat{\beta}_0 - \hat{\beta}_1x_{i1}\right)^2$$` -- - This objective function has an intuitive interpretation and visualization. --- ### Visualization of MSE <iframe width="1200" height="470" src="https://www.geogebra.org/m/XUkhCJRj"></iframe> --- ### Solution: Normal equations - We call the solution to of the optimization problem the Ordinary Least Squares Estimators or **OLS** for short. -- - The OLS for the simple linear regression can be expressed as: `$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}_1$$` `$$\hat{\beta}_1 = \sum_{i = 1}^n \frac{(y_i - \bar{y})(x_{i1} - \bar{x}_1)}{(x_{i1}-\bar{x}_1)^2}$$` --- ### Estimation of the variance - To estimate the variance we use the unbiased estimator which is: `$$\hat{\sigma}^2 = \frac{1}{n-p} \sum_{i = 1}^n (y_i - \hat{\mu}_i)^2$$` -- - were `\(p\)` is the number of "betas" in the model and `$$\hat{\mu}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1}$$` --- class: center, middle, inverse ## Now it's your turn!